Preprocessing
There are several steps followed for preprocessing the data.
-
First, the dataset is completed. If the dataframe extracted from MIO is incomplete, i.e., contains only rows with non-zero sales and no rows with 0 sales, thus having a lot of missing dates, the missing rows are added to the dataframe for each brand.
All dates from a certain starting point (usually 2018-01-01) to the end of desired forecasting period (usually 2025-02-01) are added.
import pandas as pd
all_dates = pd.date_range(start="2018-01-01", end="2025-02-01", freq="D")
all_brand_names = df["brand"].unique()
all_combinations = pd.DataFrame(
[(date, brand)
for date in all_dates
for brand in all_brand_names],
columns=["date", "brand"]
)
df = pd.merge(all_combinations, df, on=["date", "brand"], how='left')
df.fillna(0, inplace=True)
df = df[df["brand"] != 0]Missing values (for quantity delivered) are filled with 0 and if any brand is originally NaN and has been filled with 0 too, it is removed.
-
Next, the data is aggregated to monthly.
df["date"] = df["date"].dt.to_period("M").dt.to_timestamp()
df = df.groupby(["brand", "date"])["quantity_delivered"].sum().reset_index()
product_values = df[(df["date"] >= "2023-01-01") & (df["date"] <= "2023-12-01")].groupby("brand")["quantity_delivered"].sum()
discontinued_products = product_values.loc[product_values.values < 1]
df = df[~df["brand"].isin(discontinued_products.index)]
df.reset_index(drop=True, inplace=True)Furthermore, brands which have been discontinued in the region (no sales in 2023) are removed.
-
Weather data is merged with the dataframe optionally.
weather_df["date"] = pd.to_datetime(weather_df["date"])
weather_df["date"] = weather_df["date"].dt.to_period("M").dt.to_timestamp()
weather_df = weather_df.groupby(["date"]).mean().reset_index()
weather_df_date_max = weather_df["date"].max()
previous_year_start = (weather_df_date_max - relativedelta(years=1)).strftime('%Y-%m-%d')
previous_year_end = (datetime.strptime("2025-02-01", '%Y-%m-%d') - relativedelta(years=1)).strftime('%Y-%m-%d')
weather_df_features = list(weather_df.columns)
weather_df_features.remove("date")
df = pd.merge(df, weather_df, on="date", how='left')
while df[df["date"] > weather_df_date_max][weather_df_features].isna().sum().sum() > 0:
df.loc[df["date"] > weather_df_date_max, weather_df_features] = df.loc[(df["date"] > previous_year_start) & (df["date"] <= previous_year_end), weather_df_features].values
df.fillna(0, inplace=True)The weather data for future time periods is filled with weather data of the previous year.